Skip to content

[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591

Open
svij-sc wants to merge 20 commits intomainfrom
svij/easy-analyz-bq
Open

[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591
svij-sc wants to merge 20 commits intomainfrom
svij/easy-analyz-bq

Conversation

@svij-sc
Copy link
Copy Markdown
Collaborator

@svij-sc svij-sc commented Apr 17, 2026

Summary

  • Standalone DataAnalyzer module that takes a YAML config pointing at BQ node/edge tables and generates a single self-contained HTML report covering data quality, feature distributions, and graph structure — so engineers can diagnose training data issues in minutes instead of after a failed training run.
  • 4-tier validation: hard fails (dangling edges, referential integrity, duplicate nodes) → core metrics (degree distribution, hubs, cold-start, memory budget, neighbor explosion estimate) → label/heterogeneous (class imbalance, label coverage, edge type distribution) → opt-in advanced (reciprocity, homophily, connected components, clustering).
  • Thresholds and check selection backed by a literature review of 18 production GNN papers (PinSage, LiGNN, TwHIN, GiGL, BLADE, AliGraph, GraphSMOTE, Beyond Homophily, Feature Propagation, and more). Each threshold cites its source paper.

Changes

  • gigl/analytics/data_analyzer/config.py, types.py, queries.py (18 SQL templates), graph_structure_analyzer.py, feature_profiler.py (stub), data_analyzer.py orchestrator + CLI
  • gigl/analytics/data_analyzer/report/PRD.md, SPEC.md, report_generator.py, and AI-owned report.ai.html, charts.ai.js, styles.ai.css (regenerable from PRD + SPEC)
  • tests/unit/analytics/data_analyzer/ — 26 unit tests covering config parsing, SQL templates, analyzer orchestration, and HTML snapshot
  • tests/test_assets/analytics/sample_analyzer_config.yaml + golden_report.html snapshot
  • docs/plans/ — design doc, literature review, 1-pager, engineering spec (all colocated)
  • pyproject.toml — package-data declaration so .ai.* assets ship in installed wheels

Test plan

  • uv run python -m unittest discover -s tests/unit/analytics -p "*_test.py" -t . → 26/26 pass
  • make type_check → clean on 651 files
  • make check_format → clean
  • Manual: run analyzer CLI against a real BQ dataset and inspect the generated HTML

v1 scope cuts (follow-up PRs)

  • FeatureProfiler: TFDV/Dataflow integration is a working stub that logs a warning and returns empty results. The full Beam pipeline wiring (reusing GenerateAndVisualizeStats, IngestRawFeatures, init_beam_pipeline_options from the existing DataPreprocessor) will land in a follow-up PR.
  • GCS upload: The orchestrator generates the HTML but does not yet upload it; currently returns the target path with a TODO.
  • Tier 4 advanced queries: Reciprocity, homophily, connected components, and clustering coefficient are not implemented. Power-law exponent is computed as a degree-stats approximation.

Docs

  • Design doc: docs/plans/20260415-bq-data-analyzer.md
  • Literature review: docs/plans/20260415-bq-data-analyzer-references.md
  • 1-pager: docs/plans/20260416-data-analyzer-1-pager.md
  • Engineering spec: docs/plans/20260416-data-analyzer-engineering-spec.md
  • Report PRD (product intent): gigl/analytics/data_analyzer/report/PRD.md
  • Report SPEC (technical contract): gigl/analytics/data_analyzer/report/SPEC.md

svij-sc and others added 14 commits April 17, 2026 20:25
Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <shubhamvij@users.noreply.github.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
…sisResult, FeatureProfileResult)

Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the orchestration layer for BQ-based graph data quality checks:
- Tier 1 hard-fails (dangling edges, referential integrity, duplicate nodes)
  raise DataQualityError carrying a partially populated result.
- Tier 2 core metrics (counts, degree stats, top-K hubs, INT16 clamp, NULL
  rates) plus Python-side feature memory and neighbor-explosion estimates.
- Tier 3 label/heterogeneous checks auto-enabled by config (label_column
  presence; multiple edge tables).
- Tier 4 opt-in placeholders (power-law exponent from degree stats).

Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
…assets

Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the report_generator module that stitches AI-owned template,
styles, and chart JS into a single self-contained HTML report by
replacing the four INJECT_* placeholders. Adds a golden-file snapshot
test (and four structural tests) so future AI-driven edits to the
report assets fail fast until the snapshot is regenerated. Registers
the *.ai.{html,js,css} assets as package-data so importlib.resources
can resolve them from an installed wheel.

Co-Authored-By: shubhamvij <svij@snapchat.com>
Implements the main orchestrator class that coordinates graph structure
analysis, feature profiling, and HTML report generation. Includes CLI
entry point with argparse for analyzer_config_uri and resource_config_uri.

Co-Authored-By: shubhamvij <svij@snapchat.com>
…deferred)

Co-Authored-By: shubhamvij <svij@snapchat.com>
Narrows the Union return type for mypy in the direct-merge test path.

Co-Authored-By: shubhamvij <svij@snapchat.com>
Co-Authored-By: shubhamvij <svij@snapchat.com>
Sits alongside SPEC.md to separate product requirements (why and what)
from technical implementation contract (how). Both are AI-owned and
together form the input for regenerating report.ai.html, charts.ai.js,
and styles.ai.css.

Co-Authored-By: shubhamvij <svij@snapchat.com>
svij-sc added 2 commits April 17, 2026 23:51
… 1-pager, engineering spec

Colocates all planning docs for the BQ Data Analyzer feature:
- 20260415-bq-data-analyzer.md: full design doc with 4-tier validation,
  cost control, tradeoff analysis
- 20260415-bq-data-analyzer-references.md: literature review of 18
  production GNN papers with 100+ findings, common themes, and
  consolidated threshold table
- 20260416-data-analyzer-1-pager.md: executive summary for peer
  engineers and decision makers
- 20260416-data-analyzer-engineering-spec.md: per-layer implementation
  plan that the analyzer code in this branch follows

Co-Authored-By: shubhamvij <svij@snapchat.com>
@svij-sc svij-sc changed the title feat(analytics): add BQ Data Analyzer for pre-training graph data analysis Feature Analytics: Add Data Analyzer for pre-training graph data analysis Apr 18, 2026
@svij-sc svij-sc changed the title Feature Analytics: Add Data Analyzer for pre-training graph data analysis [WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis Apr 18, 2026
svij-sc added 4 commits April 20, 2026 18:01
…trator

Previously the orchestrator generated the HTML in memory but left the
upload as a TODO, forcing practitioners to copy a Python snippet to see
the output. Now DataAnalyzer.run() writes report.html under
config.output_gcs_path, detecting the scheme:

- gs:// URIs upload via GcsUtils.upload_from_string()
- local paths write via pathlib, creating parent dirs as needed

Returns the final path (GCS URI or resolved local path) so the CLI can
log it and practitioners can open the file directly.

Tests cover both local and mocked-GCS paths plus trailing-slash handling.

Co-Authored-By: shubhamvij <svij@snapchat.com>
Quickstart-first guide at gigl/analytics/README.md covering:

- 3-step quickstart (auth, YAML config, CLI command) with a single
  entry point that now writes report.html to disk or GCS
- Tier summary table (what runs when)
- Interpretation table with thresholds + "what to do" actions drawn
  from the 18-paper literature review
- Advanced config keys (opt-in Tier 3/4, label_column, timestamp_column,
  fan_out)
- Python API snippet for programmatic access
- graph_validation sub-package pointer
- Scope and limitations (FeatureProfiler stub, Tier 4 queries TODO)
- Links to design doc, literature review, 1-pager, engineering spec,
  report PRD, and report SPEC

Co-Authored-By: shubhamvij <svij@snapchat.com>
Changes from the review pass:

README fixes:
- Remove all docs/plans/* links (the plans were intentionally deleted
  in d3f1eb8). Inline the relevant paper citations into the threshold
  table so readers aren't pointed at 404s.
- Add "Prerequisites" line pointing at the GiGL installation guide so
  the quickstart doesn't assume uv/deps are already set up.
- Mark Tier 4 flags (compute_homophily, compute_connected_components,
  compute_clustering, timestamp_column) as not-yet-implemented in both
  the tier table and the Advanced Config section, not only in the
  Scope section at the bottom.
- Add the power-law exponent mention to the Tier 4 row (was only in
  scope notes; it's actually computed today).
- Document the heterogeneous-graph referential-integrity caveat
  (analyzer currently joins each edge table against node_tables[0]).
- Link to tests/test_assets/analytics/golden_report.html so a reader
  can preview the output before authenticating to BQ.

Config fix:
- NodeTableSpec.feature_columns: MISSING -> field(default_factory=list)
  so that nodes with no features are legal. Previously users got a
  cryptic OmegaConf MissingMandatoryValue error, and no-feature nodes
  are a real use case.
- Add a regression test covering the no-feature-columns case.

All 31 analytics unit tests pass. mypy clean. check_format clean.

Co-Authored-By: shubhamvij <svij@snapchat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant